Overview and Motivation

Sabermetrics has become quite popular, especially in more recent years. Researchers have developed metrics that are widely used to gauge player performance. The search for more reliable and accurate measures is still ongoing. We will begin our search by analyzing and comparing an increasingly popular but not officially recognized metric: wins above replacement (WAR). There is no universal way of calculating WAR, so we assess its accuracy across three different sources. In doing so, we also aim to derive a more “comprehensive” metric, perhaps one that is a combination of the three to improve results.

Initial Questions

Some initial questions we wish to answer include: - What is the relationship between WAR and the number of wins a team sees? - How accurately does WAR predict the number of wins a team observes in a given year? - How can the three different ways to calculate WAR be combined to produce a more reliable statistic? - What are some efficient techniques that one could use to scrape data from Fangraphs, Baseball Prospectus, and Baseball Reference, all of which are commonly visited websites in the baseball world?

Data

Data are obtained from the Teams dataset in the Lahman package and from three external sources: Baseball Prospectus, Baseball Reference, and Fangraphs. We scraped data from the three sources using various techniques that are described below. Extensive cleaning had to be done to arrive at a complete and usable dataset.

Baseball Prospectus

To calculate WARP, Baseball Prospectus uses three proprietary statistics: Batting Runs Above Replacement (BRAR), Fielding Runs Above Replacement (FRAR), and Pitching Runs Above Replacement (PRAR).

The three numbers are added and divided by the number of runs per win that season, which is another proprietary number. According to Baseball Reference, this number is around 10.

Baseball Prospectus provides player, team, and year-specific WAR statistics. Data (as shown above) are displayed annually from 1871 to 2017 with a drop down menu for user selection.

To scrape year-specific data, we vectorized over the range of years and used the function set_values and submit_form to automate the selection process and extracted the table of interest by identifying the corresponding html_node.

url_batting <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181"
url_pitching <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=1931167"

prospectus <- function(year, url){
  year = as.character(year)
  pgsession <- html_session(url)
  pgform <- html_form(pgsession)[[3]]
  
  #Fill in selection form
  filled_form <-set_values(pgform,
                           "year" = year
  )
  
  #Submit selection form
  d <- submit_form(session = pgsession, form = filled_form)
  dat <- d %>%
    html_nodes("table") %>%
    .[[5]] %>%
    html_table(header = Truth)
  dat
}

#Submit Year = X for X in [1871, 2017] to obtain year-specific data
prospectus_batting <- do.call(rbind, lapply(1871:2017, prospectus, url = url_batting))
prospectus_pitching <- do.call(rbind, lapply(1871:2017, prospectus, url = url_pitching))

#Reformat data 
prospectus_pitching <- read.csv("prospectus_pitching_full.csv")
prospectus_batting <- read.csv("prospectus_batting_full.csv")

prospectus_pitching_small <- prospectus_pitching[, c("NAME", "TEAM", "YEAR", "PWARP")]
names(prospectus_pitching_small) <- c("Name", "Team", "Year", "WARP")
prospectus_batting_small <- prospectus_batting[, c("NAME", "TEAM", "YEAR", "BWARP")]
names(prospectus_batting_small) <- c("Name", "Team", "Year", "WARP")

Baseball Reference

Baseball Reference defines rWAR as

\[ rWAR = (\mbox{Player_Runs - AvgPlayer_Runs) + (AvgPlayer_runs - ReplacePlayer_runs}) \]

rWAR is based on six different components: (1) batting runs, (2) baserunning runs, (3) runs added or lost due to grounding into double plays in double play situations, (4) fielding runs, (5) positional adjustment runs, and (6) replacement level runs. The first five components are compared against the league average, meaning an “average” league player is assigned a value of zero. It follows that negative numbers correspond to worse than average and positive numbers better than average. These five components make up the first half of the equation. The last component, replacement level runs, makes up the second part of the equation.

Data are provided for each team in different years. Separate tables are displayed for batters and pitchers. The screenshot above provides an example of one of the tables shown for the New York Yankees in 2017.

Links to these tables follow a general pattern, so a vector of all the possible links was created. We then looped through all of these links to grab the tables from each page. Final touches were made to get the datasets into the right format.

teams <- unique(Teams$franchID)
years <- 1871:2017

urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
  for(j in 1:length(years)) {
    urls[i, j] <- paste0("https://www.baseball-reference.com/teams/", teams[i], "/", years[j], ".shtml")
  }
}
url_vector <- as.vector(urls)

list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:5000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 5001:10000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 10001:17640) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

## find indices where the link exists
ind_batting <- which(!is.na(list_of_batting))
ind_pitching <- which(!is.na(list_of_pitching))

## find links that exist
url_batting <- url_vector[ind_batting]
url_pitching <- url_vector[ind_pitching]

## extract year from each url
years_batting <- as.numeric(unlist(regmatches(url_batting, gregexpr("[[:digit:]]+", url_batting))))
years_pitching <- as.numeric(unlist(regmatches(url_pitching, gregexpr("[[:digit:]]+", url_pitching))))

## extract team from each url
teams_batting <- basename(dirname(url_batting))
teams_pitching <- basename(dirname(url_pitching))

## remove NAs from lists
na.omit.list <- function(y) { 
  return(y[!sapply(y, function(x) all(is.na(x)))]) 
}

test_batting <- na.omit.list(list_of_batting)
test_pitching <- na.omit.list(list_of_pitching)

## add columns for year and team
test_batting <- mapply(cbind, test_batting, "Year" = years_batting, 
                       "Team" = teams_batting, SIMPLIFY = F)
test_pitching <- mapply(cbind, test_pitching, "Year" = years_pitching,
                        "Team" = teams_pitching, SIMPLIFY = F)

bbref_batting <- bind_rows(test_batting)
bbref_pitching <- bind_rows(test_pitching)

Exploratory Analysis

What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?

Combine data sets

#Read in BBProspectus data and remove extraneous columns
BBProspectus <- read.csv("prospectus.csv")
BBProspectus <- BBProspectus %>%
    rename(BBproWAR = WARP, teamID = franchID) %>%
    select(c(-X, -W))

#Baseball Prospectus had some funny names, so first we need to align them with our other datasets using Lahman
BBProspectus$franchID = NA

BBProspectus = BBProspectus %>%
    filter(yearID >= 1949)

for (i in 1:length(BBProspectus[, 1]))
{
    
    BBProspectus[i, ]$franchID = 
        as.character(unique(
            Teams$franchID[as.character(Teams$teamID) == as.character(BBProspectus[i, ]$teamID)])[1])
}    

BBProspectus = BBProspectus %>%
    select(c(-teamID))

#Read in BBProspectus data and remove extraneous columns
BBReference <- read.csv("BBReference.csv")
BBReference <- BBReference %>%
    filter(yearID >= 1949) %>%
    rename(BRefWAR = WAR) %>%
    select(c(-X, -source, -W))

BBReference = BBReference[!is.na(BBReference$BRefWAR), ]
#Baseball Reference Team IDs are based on the team of that year, need to correct this for a few teams
alts = c("LAA", "CAL", "TBR", "MIA", "PHA")
delt = c("ANA", "ANA", "TBD", "FLA", "OAK")

for (i in 1:length(BBReference[, 1]))
{
    if ((BBReference[i, ]$franchID %in% alts))
    {
        BBReference[i, ]$franchID =  delt[match(BBReference[i, ]$franchID, alts)]
    }
}

FanGraphs <- read.csv("Conor/FanGraphWARs_better_format.csv")
FanGraphs <- FanGraphs %>%
    rename(FanGraphWAR = WAR, yearID=year) %>%
    select(c(-X))

#Inner join the three for a complete data set
data <- inner_join(BBReference, BBProspectus, by = c("franchID", "yearID"))
data = inner_join(data, FanGraphs, by =c("franchID", "yearID"))

#Join combined data with true number of wins in Lahman's dataset to create wide data set 
team_sub <- Teams %>%
    select(yearID, franchID, W, G)  %>%
    rename(Truth = W, Games = G)
data_wide <- left_join(data, team_sub, by = c("franchID", "yearID"))

data_wide = data_wide %>%
    filter(yearID %in% 1949:2016) %>%
    mutate(BRefWAR = BRefWAR + Games * 0.294, FanGraphWAR = FanGraphWAR + Games * 0.294, BBproWAR = BBproWAR + Games*0.320) %>%
    rename(BRefPred = BRefWAR, FanGraphPred = FanGraphWAR, BBproPred = BBproWAR)

data_long <- gather(data_wide, source, WAR, -c(franchID, yearID)) ##TODO do we need this? -Conor

How well does WAR compare to the true number of wins?

MSE(data_wide$Truth, data_wide$FanGraphPred)
## [1] 6.239231
MSE(data_wide$Truth, data_wide$BBproPred)
## [1] 9.256546
MSE(data_wide$Truth, data_wide$BRefPred)
## [1] 5.136446
statistics_matrix = matrix(nrow = 3, ncol = 3)
rownames(statistics_matrix) = c("Fan Graph", "Baseball Prospectus", "Baseball Reference")
colnames(statistics_matrix) = c("Mean Difference", "Standard Deviation", "Slope")

FG_diff = data_wide$FanGraphPred - data_wide$Truth
statistics_matrix[1,1] = mean(FG_diff)
statistics_matrix[1,2] = sd(FG_diff)
statistics_matrix[1,3] = lm(Truth ~ FanGraphPred, data = data_wide)$coef[[2]]

BBP_diff = data_wide$BBproPred - data_wide$Truth
statistics_matrix[2,1] = mean(BBP_diff)
statistics_matrix[2,2] = sd(BBP_diff)
statistics_matrix[2,3] = lm(Truth ~ BBproPred, data = data_wide)$coef[[2]]

BRef_diff = data_wide$BRefPred - data_wide$Truth
statistics_matrix[3,1] = mean(BRef_diff)
statistics_matrix[3,2] = sd(BRef_diff)
statistics_matrix[3,3] = lm(Truth ~ BRefPred, data = data_wide)$coef[[2]]

show(statistics_matrix)
##                     Mean Difference Standard Deviation     Slope
## Fan Graph                 2.5455374           5.698188 1.0017252
## Baseball Prospectus       2.8262118           8.817408 0.6918975
## Baseball Reference        0.2340364           5.132780 0.9863421
diff2 = cbind(FG_diff, BBP_diff, BRef_diff)
diff2 = as.data.frame(diff2)

diff2 %>% ggplot() + 
    geom_histogram(aes(x=FG_diff, fill="Fan Graph"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BBP_diff, fill="Baseball Prospectus"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BRef_diff, fill="Baseball Reference"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    ggtitle("Difference Between Predicted and True Wins") +
    xlab("Difference") +
    ylab("Count") +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1)))

data_wide %>% ggplot(aes(x = Truth, y = FanGraphPred)) +  
    geom_point(color="red", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Fan Graph True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = BBproPred)) +  
    geom_point(color="green", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Prospectus True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = BRefPred)) +  
    geom_point(color="blue", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Reference True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

From this plot, it appears that both rWAR and WARP capture the trend of true wins over time. However, there is a substantial bias because both WAR statistics are hugely underestimating the truth.

Next, let’s build a naive model to capture the effect of WAR on predicting the truth for each of the 3 WAR statistics.

#Linear regression for rWAR and WARP, respectively
m1 <- lm(Truth ~ rWAR, data = data_wide)
m2 <- lm(Truth ~ WARP, data = data_wide)

#Make predictions based on model fit
pred <- predict(m1, newdata = data_wide)
pred2 <- predict(m2, newdata = data_wide)

#Wrangle data for plotting
df_predRef <- cbind(data_wide, rWARpred = pred)
a2 <- df_predRef %>%
  group_by(yearID) %>%
  summarize(sum.rWARpred = sum(rWARpred))

df_predPros <- cbind(data_wide, WARPpred = pred2)
a3 <- df_predPros %>%
  group_by(yearID) %>%
  summarize(sum.WARPpred = sum(WARPpred))

p <- ggplot() + 
  geom_line(data = data_sum[data_sum$source == "Truth", ], aes(x = yearID, y = sumWAR, group = source, color = source)) +
  geom_line(data = a2, aes(x = yearID, y = sum.rWARpred, color = "rWAR Pred")) + 
  geom_line(data = a3, aes(x = yearID, y = sum.WARPpred, color = "WARP Pred")) + 
  ggtitle("Number of Wins vs. Fitted Values from Naive Model by Year") +
  xlab("Year") +
  ylab("Number of Wins")
p


a <- data_wide %>%
  group_by(yearID) %>%
  summarize(sumW = sum(Truth))

MSE(a$sumW, a2$sum.rWARpred)
MSE(a$sumW, a3$sum.WARPpred)

According to RMSE, rWAR is a better predictor than WARP. However, we can definitely improve upon the naive model by accounting for the substructure within our data, e.g. team effect, team x year interaction effect, etc.

data_long <- gather(data_wide, Pred, Val, -c(franchID, yearID, Truth, Games))
data_long2 <- data_long %>%
  mutate(Residuals = (Val - Truth))

gg2 <- data_long2 %>%
  ggplot(aes(x = franchID, y = Residuals)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~ Pred, ncol = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Team Name") +
  ggtitle("")
gg2

Here, we plot the residuals (WAR preditions - true number of wins) across different teams for the 3 WAR statistics WARP (top), rWAR (middle), and fWAR (bottom). To assess degree of fit, we visually examine these boxplots, with boxplots centered around 0 suggesting good fit.

From these residual boxplots, we can see that Baseball Prospectus has the largest residuals (longest boxes) across the different teams. Moreover, it also has many outliers, which may dilute signal. On the other hand, Baseball Reference and Fangraph’s WAR statistics perform relatively well, with the boxplots centered around 0.

Final Analysis

What did you learn about the data? How did you answer the questions? How can you justify your answers?

set.seed(123)
k <- 10
#Perform k-fold cross validation
folds <- createFolds(1:nrow(data_wide), k)

#Define function for performing gradient-free optimization
fn <- function(x){
  ref <- x[1]
  pros <- x[2]
  fan <- x[3]
  pred <- ref*data_wide$BRefPred + pros*data_wide$BBproPred + fan*data_wide$FanGraphPred 
  MSE(pred, data_wide$Truth)
}

#Initialize output matrix
MSE_val <- matrix(NA, 4, k)
coeffs <- matrix(NA, 3, k)

for(i in 1:k){
  test_df <- data_wide[folds[[i]],]
  train_df <- data_wide[-folds[[i]],]
  
  m1 <- lm(Truth ~ BRefPred, data = train_df)
  m2 <- lm(Truth ~ BBproPred, data = train_df)
  m3 <- lm(Truth ~ FanGraphPred, data = train_df)
  m4 <- optim(c(1, 1, 1), fn, method = c("Nelder-Mead"))
  
  ## test prediction on test dataset
  pred1 <- predict(m1, newdata = test_df)
  pred2 <- predict(m2, newdata = test_df)
  pred3 <- predict(m3, newdata = test_df)
  
  test_df_lm <- cbind(test_df,
                      BRef_lmpred = pred1,
                      BBpro_lmpred = pred2, 
                      FanGraph_lmpred = pred3)
  
  
  MSE_val[1, i] <- MSE(test_df_lm$Truth, test_df_lm$BRef_lmpred)
  MSE_val[2, i] <- MSE(test_df_lm$Truth, test_df_lm$BBpro_lmpred)
  MSE_val[3, i] <- MSE(test_df_lm$Truth, test_df_lm$FanGraph_lmpred)
  MSE_val[4, i] <- m4$value
  for(j in 1:3){
    coeffs[j, i] <- m4$par[j]
  }
}

rownames(MSE_val) <- c("Baseball Reference", "Baseball Prospectus", "Fangraph", "Combined Model")
apply(MSE_val, 1, mean)
##  Baseball Reference Baseball Prospectus            Fangraph 
##            5.130914            7.658060            5.694803 
##      Combined Model 
##            4.685790
#Obtain coefficients from combined model
final_coef <- apply(coeffs, 1, mean)

#Make predictions based on combined model
data_wide <- data_wide %>%
  mutate(CombinedPred = final_coef[1]*data_wide$BRefPred + final_coef[2]*data_wide$BBproPred + final_coef[3]*data_wide$FanGraphPred )


data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    geom_point(aes(x = Truth, y = CombinedPred, color = "Final Model"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1, 0, 0), "Baseball Prospectus" = rgb(0, 1, 0), "Baseball Reference"=rgb(0, 0, 1), "Final Model" = rgb(0, 0, 0))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).